perf(inference): streaming TRT optimizations and optional bucketed Flow engines#1894
Open
BeckYang26 wants to merge 2 commits into
Open
perf(inference): streaming TRT optimizations and optional bucketed Flow engines#1894BeckYang26 wants to merge 2 commits into
BeckYang26 wants to merge 2 commits into
Conversation
- 流式推理用 Condition 替代 sleep 轮询,首包 hop_len 仍保持 25 - solve_euler 内复用 TRT context,用 wait_stream 替代全量 sync - 修复 TrtContextWrapper 中 CUDA Stream 对象创建方式
- 新增 TrtBucketedContextWrapper,按 seq_len 路由 256/768/1536/3000 四档 engine - load_trt 支持 trt_bucket 参数,缺失 plan 时自动从 optimize.onnx 构建 - get_trt_kwargs 对齐 export_onnx 六个输入,支持 max_len 分桶 profile - CosyVoice/CosyVoice2/3 新增 trt_bucket 构造参数,优先使用 optimize.onnx - flow_matching 调用 acquire_estimator(seq_len=...) 完成运行时选桶
BeckYang26
commented
May 22, 2026
| # FSQ silent and breath token | ||
| self.silent_tokens = [1, 2, 28, 29, 55, 248, 494, 2241, 2242, 2322, 2323] | ||
| self.condition_dict = {} | ||
| self.first_token_hop_len = 25 |
Author
There was a problem hiding this comment.
这里first_token_hop_len可以改更小一点,首包会更快,但可能音质有损失
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR improves CosyVoice2/3 streaming TTS + vLLM + TensorRT Flow inference in two commits. Both are backward-compatible: no breaking API changes, no new environment variables in
cosyvoice/, and CosyVoice1 / non-streaming paths are untouched.Benchmark and test details: #1892
Commit 1 —
perf(inference): reduce streaming latency and optimize TRT inferenceStreaming (CosyVoice2/3):
time.sleep(0.1)polling withthreading.Conditionwake-ups so the main thread proceeds as soon as tokens arrive (~100ms polling waste removed).llm_job()tonotify()after each token and on completion.hop_lenin the streaming loop instead of mutatingself.token_hop_len, so repeated streaming requests reset hop correctly.first_token_hop_lenas a class constant (currently 25, aligned with trainingchunk_size); trigger threshold unchanged vs upstream.TensorRT (Flow DiT):
solve_euler()(was acquire/release per step).wait_stream()instead of device-widesynchronize()for finer overlap with vLLM.TrtContextWrapperto store a realtorch.cuda.Streamobject.Files:
cli/model.py,flow/flow_matching.py,utils/common.pyCommit 2 —
feat(inference): optional bucketed TensorRT engines for Flow DiTMotivation: Default single TRT plan is built for
max seq_len=3000. Short streaming chunks still run on a large engine, increasing latency and VRAM per context.Changes:
trt_bucketparameter toAutoModel/CosyVoice*/load_trt()(defaultFalse).{model_dir}/trt_bucket_plans/when plans are missing.TrtBucketedContextWrapperforseq_len-aware routing viaacquire_estimator(seq_len=...).flow.decoder.estimator.fp32.optimize.onnx, fallback to officialfp32.onnx.get_trt_kwargs()to use 6 ONNX inputs (x,mask,mu,t,spks,cond) with per-bucket profiles.Files:
utils/file_utils.py,utils/common.py,cli/model.py,cli/cosyvoice.py,flow/flow_matching.pyCompatibility
trt_bucket=False)trt_bucket=Truetrt_bucket_plans/; does not read*.mygpu.planmygpu.planTest plan
load_vllm=True+load_trt=True,stream=True: first-packet latency vs upstreamstream=Falseregressiontrt_bucket=False: single-plan path unchangedtrt_bucket=True+optimize.onnx: auto-build 4 bucket plans on first runseq_lenReview guide
Use the Commits tab to review each commit independently: